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Abstract —Accurate, robust, inexpensive gaze tracking in the car 
can help keep a driver safe by facilitating the more effective 
study of how to improve (1) vehicle interfaces and (2) the design 
of futnre Advanced Driver Assistance Systems. In this paper, we 
estimate head pose and eye pose from monocnlar video using 
methods developed extensively in prior work and ask two new 
interesting questions. First, how mnch better can we classify 
driver gaze nslng head and eye pose versus just using head pose? 
Second, are there individnal-specific gaze strategies that strongly 
correlate with how much gaze classification Improves with the 
addition of eye pose information? We answer these qnestions by 
evaluating data drawn from an on-road study of 40 drivers. The 
main insight of the paper is conveyed throngh the analogy of an 
“owl” and “lizard” which describes the degree to which the eyes 
and the head move when shifting gaze. When the head moves 
a lot (“owl”), not much classilication Improvement is attained 
by estimating eye pose on top of head pose. On the other hand, 
when the head stays still and only the eyes move (“lizard”), 
classification accuracy increases significantly from adding in eye 
pose. We characterize how that accuracy varies between people, 
gaze strategies, and gaze regions. 

Index Terms —Head pose estimation, pnpil detection, gaze track¬ 
ing, driver distraction, driver assistance systems, on-road stndy. 


I. Introduction 

The classification of driver visual attention allocation is an area 
of increasing relevance in the pursuit of accident reduction. 
The allocation of visual attention away from the road has 
been linked to accident risk and a drop in situational 

awareness as uncertainty in the environment increases Q. 
Driver distraction is often construed as a key source of atten¬ 
tion divergence from the roadway and the topic of numerous 
scientific studies and design guidelines 0 . 0 - 


rapidly varying lighting conditions. Other challenges, common 
to other domains, include unpredictability of the environment, 
presence of eyeglasses or sunglasses occluding the eye, partial 
occlusion of the pupil due to squinting, vehicle vibration, 
image blur, poor video resolution, etc. We consider the chal¬ 
lenging case of uncalibrated monocular video because it has 
been and continue to be the most commonly available form 
of video in driving datasets due to its low equipment and 
installation costs. 


From the perspective of image processing, gaze estimation can 
be divided into two components; head pose estimation and 
eye pose estimation. Due to all the factors above, the latter 
is more difficult than the former. In fact, gaze classification 
performance can be good based on head pose alone Q, 
because it frequently correlates with eye pose, but not always. 
“Eye pose” and “head pose” are terms used throughout this 
paper to mean the relative orientation of the pupil in the eye 
socket and the relative orientation of facial features on the 
head, respectively. This use of “pose” is made broader in order 
to allow for the nonlinear modeling discussed in { IV-B| 


In this paper, we seek to characterize when eye pose signifi¬ 
cantly contributes to gaze classification and when it does not. 
Specifically, we ask two questions; 


1) Contribution of Eye Pose: How much better can we 
classify driver gaze using (a) head and eye pose together 
versus (b) using head pose alone. 


2) Classification of Different Gaze Strategies: Are there 
individual-specific gaze strategies that strongly correlate 
with how much gaze classification improves with the 
addition of eye pose information? 


Furthermore, as the level of vehicle automation continues 
to increase through Advanced Driver Assistance Systems as 
well as other higher forms of automation, freeing available 
resources from the primary operational task, drivers are ex¬ 
pected to be increasingly allowed to glance away from the 
roadway for greater periods. When the need arises to orient 
the driver to the roadway, different alerting strategies may 
be advantageous. Such work would suggest that a real-time 
estimation of drivers gaze could be coupled with an alerting 
system to enhance safety Q. Gaze tracking from video in 
the driving context is a difficult problem due especially to 


These two questions are answered by analyzing data drawn 
from an on-road study of 40 drivers performing secondary 
tasks of varying complexity. The inter-person classification and 
gaze strategy variation is discussed using the analogy of an 
“owl” and “lizard” (introduced previously in Q, Q) which 
describes the degree to which the eyes and the head move 
when shifting gaze. When the head moves a lot (“owl”), not 
much classification improvement is attained by estimating eye 
pose on top of head pose. On the other hand, when the head 
stays still and only the eyes move (“lizard”), classification 
accuracy increases significantly from adding in eye pose. 







"Owl" gaze strategy: Using the head (in addition to eyes) to allocate attention. 
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"Lizard" gaze strategy: Using the eyes to allocate attention without moving the head. 
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Fig. 1; Examples of gaze strategies that explain the “owl” and “lizard” analogy for head and eye pose. The “owl” gaze strategy 
involves primarily head movement, while the “lizard” gaze strategy involves primarily eye movement. The spectrum between 


these two is discussed in IV-B 


Examples of the two strategies are shown in Eig. We 
propose an end-to-end driver gaze classification system based 
on monocular video and use it to explore the importance of 
eye pose for classification performance as we move along the 
spectrum of people from “owl” to “lizard”. 

II. Related Work 

The problem of gaze tracking from monocular video has been 
investigated extensively across many domains | |T0| , pT) . We 
build on this work to characterize the individual contribution 
of head movement and eye movement to gaze classification 
accuracy. The building blocks of our image processing pipeline 
are; face alignment, head pose estimation, and pupil detection. 
We apply cutting-edge algorithms from these fields to answer 
two questions posed by our work (see on a large on-road 
driving dataset. 

The algorithm in uses an ensemble of regression trees 


for super-real-time face alignment. Our face feature extraction 
algorithm draws upon this method as it is built on a decade of 
progress on the face alignment problem (see 112| for a detailed 
review of prior work). The key contribution of the algorithm is 
an iterative transform of the image to a normalized coordinate 
system based on the current estimate of the face shape. Also, 
to avoid the non-convex problem of initially matching a model 
of the shape to the image data, the assumption is made that the 
initial estimate of the shape can be found in a linear subspace. 


Head pose estimation has a long history in computer vision. 
Murphy-Chutorian and Trivedi GD describe 74 published and 
tested systems from the last two decades. Generally, each 
approach makes one of several assumptions that limit the 
general applicability of the system in driver gaze detection. 
These assumptions include: (1) the video is continuous, (2) 
initial pose of the subject is known, (3) there is a stereo vision 
system available, (4) the camera has frontal view of the face, 
(5) the head can only rotate on one axis, (6) the system only 











































has to work for one person. While the development of a set of 
assumptions is often necessary for the classification of a large 
number of possible poses, our approach skips the head pose 
estimation step (i.e. the computation of a vector in 3D space 
modeling the orientation of the head) and goes straight from 
the detection of a facial features to a classification of gaze to 
one of six glance regions. Prior work has shown that such a 
classification set is sufficient for the in-vehicle environment, 
even under rapidly shifting lighting conditions |]7). 

Pupil detection approaches have been extensively studied. 
Methods usually track corneal reflection, distinct pupil shape 
in combination with edge-detection, characteristic light inten¬ 
sity of the pupil, or a 3D model of the eye to derive an estimate 
of an individual’s pupil, iris, or eye position 03 . Our approach 
uses an adaptive CDF-based method HD in conjunction with 
face alignment that significantly narrows the search space. 

Studies of the correlation between head and eye movement 
have shown inter-person variation in the degree to which the 
head serves as a proxy for gaze 0 ^ 0 - For example, a 
previous work tested drivers’ head movements while looking 
at the “road” and the “center stack” and found that; (1) 
drivers’ horizontal range of head movements varied (from 5 
to 20 degrees) across individuals along with (2) their mean 
differences of horizontal head angles while looking at the 
two objects (from 0 to 10 degrees) Q. This paper makes 
this variation more explicit by characterizing classification 
performance with and without eye pose information. 

III. Dataset 

Training and evaluation is carried out on a dataset of 40 
subjects drawn from a larger driving study of 80 subjects that 
took place on a local interstate highway (see GD for detailed 
experimental methods). For each subject, the collection of 
data was carried out in one of two vehicles; 2013 Chevrolet 
Equinox or Volvo XC60 (randomly assigned). The drivers 
performed a number of secondary tasks of varying difficulty 
including using the voice interface in the vehicle to enter 
addresses into the navigation system and using the voice 
interface as well as manual controls to select phone numbers 
from a stored phone list. 

Both vehicles were instrumented with an array of sensors for 
assessing driver behavior. The sensor set included a camera 
positioned on the dashboard of each vehicle that was intended 
to capture the driver’s face for annotation of glance behavior. 
The cameras were positioned off-axis to the driver and in 
slightly different locations in the two vehicles (based upon 
features of the dashboard, etc.). As each driver positioned 
the seat (electronic in both vehicles) differently the relative 
position of the driver in relation to the camera varied somewhat 
by subject and across each driver over time (i.e., drivers move 
continuously in the seat, etc.). The camera was an Allied 
Vision Tech Guppy Pro F-125, capturing grayscale images 
at a resolution of 800x600 and speed of 30fps. The data 


was double manually annotated for driver glances transitions 
during secondary task periods (at a resolution of sub-200ms) 
into one of 11 classes (road, center stack, instrument cluster, 
rearview mirror, left, right, left blindspot, right blindspot, 
passenger, uncodable, and other). As detailed in fI3 , any 
discrepancies between the two annotators were meditated by 
an arbitrator. This method of double annotation and mediation 
of driver gaze has been shown to produce very accurate 
annotations that can be effectively used as ground truth for 
supervised learning approaches GZl- 


Pruning Steps 

Total Frames 

Fraction of 

Remaining 

Original 

0. Total Frames Annotated 

1,351,864 

100% 

1. Frames with Faces Detected 

1,073,380 

79.4% 

2. Frames with Pupils Detected 

833,049 

61.6% 


TABLE I; Dataset statistics for the total number of video 
frames annotated, the number of frames where faces were 
detected, and the number of frames where pupil were detected. 


Each of these pruning steps are discussed in (IV 


In this paper, a broad random subset of data was drawn from 
the initial experiment and the “left” and “left blind spot” 
classes / “right”, “right blind spot”, “passenger” classes were 
collapses respectively in to “left” and “right”. Periods that 
were labeled “uncodable” and “other” were excluded. Subject 
pruning was completed to ensure that every subject under 
consideration has sufficient training data for each of the six 
glance regions (road, center stack, instrument cluster, rearview 
mirror, left, and right). The threshold for “sufficient training” 
was that each subject had at least 120 frames of video (where 
pupils were detected) for each of the six gaze regions. 


As shown in Table [Ij the resulting dataset contains 1,351,864 
images each annotated as belonging to one of six glance 
regions. The algorithm described in §IV| is used for face 
detection, face alignment, and pupil detection. The gaze classi¬ 
fication approach requires a face and a pupil to be successfully 
detected in the image. The filtering procedure is discussed in 
detail in (IV Therefore, in the evaluation we include only 
the images where a face and a pupil is detected. As the table 
shows, on average, a face is detected in 79.4% of images. 
61.6% of images pass the full image processing pipeline where 
both a face and a pupil are detected. 


IV. Gaze Classification Pipeline 

The steps in the gaze region classification pipeline are; (1) face 
detection, (2) face alignment, (3) pupil detection, (4) feature 
extraction and normalization, (5) classification, (6) decision 
pruning. If the system passes the first three steps, it will lead 
to a gaze region classification decision for every image fed 
into the pipeline. In step 6, that decision may be dropped if it 
falls below a confidence threshold (see §IV-E| i. The three face 
images in Pig. are examples of the result achieved in the 
first four steps of the pipeline; going from a raw video frame 












Fig. 2: Three example images showing the results of the feature extraction and the intermediate steps of the pupil detection. 
The black dots designate facial landmarks and the single white dot designates the pupil position in the right eye. There are 58 
facial landmarks shown with 10 inner-mouth landmarks removed in the visualization for the purpose of visual clarity. Below 


each of the three face images are 6 steps of the pupil detection described in IIV-C 


to extracted face features and pupil position. As mentioned in 
^ the relative orientation of facial features serves as a proxy 
for “head pose” and the relative orientation of pupil position 
serves as a proxy for “eye pose”. We discuss each of the six 
steps in the pipeline in the following sections. 

A. Face Detection 

The environment inside the car is relatively controlled in that 
the camera position is fixed and the driver torso moves in 
a fairly contained space. Thus, a camera can be positioned 
such that the driver’s face is always fully or almost fully 
in the field of view. However, the lighting conditions are 
sometimes drastically variable (e.g. quickly passing under a 
bridge, reflection of the sun on the camera lens, etc.) and thus 
there are frequently cases where the intensity distribution of 
the image does not allow for successful detection of the face 
(i.e. false reject). Every detection step in the pipeline is tuned 
to have a low False Accept Rate (FAR). A false accept error 
early in the pipeline propagates and can result in drastically 
incorrect head pose and eye pose estimation. In the context 
of video-based driver gaze classification, a high False Reject 
Rate (FRR) is more acceptable than a high FAR. 

The face detector in our pipeline uses a Histogram of Oriented 
Gradients (HOG) combined with a linear SVM classifier, 
an image pyramid, and sliding window detection scheme 
implemented in the DFIB C-H- library p8) . The performance 
of this detector has lower FAR than the widely-used default 
Haar-feature-based face detector available in OpenCV | [T9) and 
thus is more appropriate for our application. 


B. Face Alignment and Head Pose 

Both face alignment and head pose estimation are extremely 
well studied problems in computer vision |T3), 1^. We 
investigated several cutting edge methods from each domain, 
and chose the ones that worked best for monocular video with 
highly varying lighting conditions. 

Face alignment in our pipeline is performed on a 68-point 
Multi-PIE facial landmark mark-up used in the iBUG 300-W 
dataset m- These landmarks include parts of the nose, upper 
edge of the eyebrows, outer and inner lips, jawline, and parts in 
and around the eye. The selected landmarks are shown as black 
dots in Fig. The algorithm for aligning the 68-point shape 
to the image data uses a cascade of regressors as described 
in fl^ and implemented in p8] . The two characteristics of 
this algorithm most important to driver gaze localization is: 
(1) it is robust to partial occlusion and self-occlusion and (2) 
its running-time is significantly faster than the 30 fps rate of 
incoming images. 

Face alignment produces estimates for facial feature positions 
in the image. These features can be mapped directly to a gaze 
region using methods that fall under the Nonlinear Regression 
category defined in GD- They can also be mapped to a 3d 
model of the head. The resulting 3D-2D point correspondence 
can be used to compute the orientation of the head. This is 
categorized under Geometric Methods in p3) . Then the yaw, 
pitch, and roll of the head can be used as features for a gaze 
region classifier. We implemented both methods and found the 
former (nonlinear classification) to be more robust to errors in 



























































































the face alignment and pupil detection steps of the pipeline. 
The geometric approach uses OpenCV’s SolvePnP solution of 
the PnP problem ||22). The nonlinear classification approach 


is discussed further in ^IV-E 


2) When a sufficiently large blob is not found in the eye 
region (less than 5 pixels in area), it is assumed that the 
face alignment did not properly localize the eye and the 
image is removed from the pipeline. 


C. Pupil Detection 


D. Feature Extraction and Normalization 


As described in ^ the problem of accurate pupil detection 
is more difficult than the problem of accurate face alignment, 
but both are not always robust to poor lighting conditions. 
Therefore, the secondary task of pupil detection is to flag 
errors in the face alignment step that preceded it. As Table |I] 
shows, the face is detected in 79.4% video frames but only 
61.6% of the original frames pass the pupil detection step. 

We use a CDF-based method GD to extract the pupil from 
the image of the right eye, and adjust the extracted pupil blob 
using morphological operations of erosion and dilation. The 
six steps in this process are as follows: 

1) Extract the right eye from the face image based on the 
right eye features computed as part of the face alignment 
step. 

2) Remove all pixels that fall outside the boundaries of the 
polygon defined by the 6 eye features. 

3) Rescale the intensity such that the 98-percentile intensity 
becomes 1.0 intensity and 2-percentile intensity becomes 
0.0 intensity. 

4) Define a CDF intensity threshold and convert the 
grayscale image to a binary image. Each pixel intensity 
above the threshold becomes 1, and otherwise becomes 
0 . 


5) Perform an “opening” morphology transformation (de¬ 
scribed in |23|). This operation is useful for removing 
small holes in large blobs. 


6) Perform a “closing” morphology transformation (de¬ 
scribed in ||23|). This operation is useful for removing 
small objects and smoothing the shape of large blobs. 


The above steps have three parameters: the CDF threshold, 
the opening window size, the closing window size. These 
parameters are dynamically optimized for each image over 
a discrete set of values in order to maximize the size of the 
largest resulting blobs under one constraint: the largest blob 
must be circle-shaped (i.e. have similar height and width). 
More specifically, each of the 3 parameters take on 3 values 
and using exhaustive search we find the set of parameter values 
that results in the largest circular blob. 


The pupil detection process also includes pruning procedures 
based on whether the eye is sufficiently open and whether 
there is a possible error in the preceding face alignment step. 
These are: 


The driver spends more than 90% of their time looking forward 
at the road and this fact was used in 0 to normalize the 
position of facial features relative to the average bounding 
box of the face associated with the “Road” gaze region. This 
required an initial 120 second period of automated calibration. 
In this paper, we remove the need for calibration and instead 
normalize the facial features based on the bounding box of 
the eyes and nose for the current frame only. Fig. [T] shows 
departure of the head and eyes away from their “reference” 
positions. The normalization step linearly tranforms the facial 
landmarks such that the landmarks of the eyes and nose fit a 
unit square. After this transformation, the relative orientation 
of the facial landmarks becomes the feature vector for the gaze 
classification step. The bounding box of the eyes and nose 
was experimentally found to be the most robust normalizing 
region. This is due to the fact that the greatest noise in the 
face alignment step was associated with the features of the 
jawline, the eyebrows and the mouth. The position of the pupil 
is normalized to the bounding box of the eye rotated such that 
the two eye corners lie on a horizontal line. 


E. Classification and Decision Pruning 


Scikit-learn implementation of a random forest classifier 1241 


is used to generate a set of probabilities for each class from 
a single feature vector. The probabilities are computed as the 
mean predicted class probabilities of the trees in the forest. The 
class probability of a single tree is the fraction of samples of 
the same class in a leaf. A random forest classifier of depth 
25 with an ensemble of 2,000 trees is used for all experiments 
in [|V] The class with the highest probability is the one that 
the system assigns to the image as the “decision”. The ratio 
of the highest probability to the second highest probability is 
termed the “confidence” of the decision. A confidence of 1 
is the minimum. There is no maximum. The effect of this 
threshold is explored in 0 - For the experiments in 0a 
confidence threshold of 10 is used, which means that any 
decisions with a confidence greater than 10 are accepted and 
the others are ignored. A random forest classifier was used 
because it achieved a much higher accuracy than k-nearest 
neigbors (KNN) and linear SVM classifiers. RBF-kernel SVM 
achieved a slightly higher accuracy but at the cost of over a 
100-fold increase in training time. 


V. Results 

A. Gaze Region Classification 


1) An eye shape height that is less than 10% of its width is 
considered “closed” and is removed from the pipeline. 


We evaluate the gaze classification pipeline described in (IV 
on the dataset of 40 drivers described in 31111 In all the 
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Fig. 3: Confusion matrix for the six-region classification 
problem for using both head pose and eye pose information. 
The overall accuracy is 94.6%. The confidence threshold is 
set to 10 resulting in an average conhdent decision rate of 2.3 
times a second. When considering that only 61.6% of frames 
pass the face detection and pupil detection steps, the effective 
overall decision rate is 1.3 times a second. 



Fig. 4; Increase in accuracy per gaze region achieved by using 
eye pose in addition to head pose. The increase results in the 
confusion matrix show in Fig. [3] 


pipeline. As discussed in { IV-E| we further reduce this number 
during testing by only considering decisions with a confidence 
above the confidence threshold of 10. On average, only 7.1% 
of all decisions are deemed conhdent in this way, resulting 
in a decision rate of 2.3 Hz. This selection is distributed 
evenly through time among cases where a face is successfully 
detected. If we consider the fraction of original raw video 
frames that lead to a conhdent gaze classihcation decision, 
then the overall effective decision rate is 1.3 Hz. 


All of the plots in this sections share the same experiment 
setup. For each user, we train the 6-class classiher on all 
39 others users. The training data for each of the 6 classes 
is balanced by random sub-sampling p5| . The testing is 
performed on the data for the one user by balancing the classes 
through super-sampling. This helps ensure that the per-class 
accuracy is not skewed by the greater representation of “Road” 
versus the other hve classes in the dataset. The process is 
repeated 100 times for each of the 40 users. The plots with 
eiTorbars indicate the standard deviation of accuracy among 
the 100 runs for each user. 

Fig .[^ shows the confusion matrix for classification using both 
head pose and eye pose. The overall accuracy achieved is 
94.6%. Most of the errors in classification are in incorrectly 
labeling an image as “Road” when it is one of the other 5 
gaze regions. Fig. compares the accuracy in this confusion 
matrix with that achieved by a system that only uses head pose 
information. The overall accuracy achieved by such a system 
is 89.2%. One of the questions posed by this paper is; how 
much to we gain by considering eye pose on top of head pose? 
The answer in our hnal optimized system is 5.4% accuracy. 
As Fig. 1^ shows, the biggest gain of 8.7% is achieved for the 
center stack region. This can be interpreted to mean that people 
are more likely to use only their eyes when glancing down 
to the center stack or that the head pose associated with the 
center stack is similar to the head pose of other gaze regions 
like “Road”, “Instrument Cluster”, and “Rearview Mirror” as 
Fig. 0 suggests. 


B. User-Specific Classification and Gaze Strategies 


experiments and discussions that follow, the key comparison 
is between classification performed using head pose alone 
and classification performed using head pose and eye pose 
together. The classihcation problem has six classes, one for 
each of the six gaze regions: (1) road, (2) center stack, (3) 
instrument cluster, (4) rearview mirror, (5) left, and (6) right. 

The pipeline starts at an annotated frame from the raw video. 
As previously described, each frame is double annotated and 
mediated ensuring that the gaze region annotations can reliably 
serve as ground truth for the cross validation training and 
testing. There are a total of 1,351,864 annotated images. As 
show in Table 833,049 of those images pass through the 
face detection, face alignment, and pupil detection steps of the 


As shown in i V-A adding in eye pose to head pose increases 
gaze classihcation accuracy by 5.4%. But that doesn’t tell the 
full story because some user-trained classihers beneht more 
from eye pose than others. Fig. shows the variation in 
accuracy among users before and after adding in eye pose to 
the classihcation feature set. For many users, 100% accuracy 
is achieved, while for many other accuracy drops to below 
80% and even to as low as 40%. 


Using the Pearson correlation coefficient as a guide, we 
programmatically explored over a million pairs of variables 
in search of an answer to the question of what explains this 
difference in classihcation performance between users. Some 
variables correlate with per-region accuracy but not overall. 
For example, average magnitude of off-center head movement 














































(a) Head pose alone. Average accuracy: 89.2% 


(b) Head pose and eye pose. Average accuracy: 94.6% 


Fig. 5: Per-user accuracy in increasing order for confidence threshold of 10 and resulting decision rate of 2.31 times a second. 
The difference between the two is explored further in Fig. 


6b 
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(a) Increase in per-user accuracy versus the “owlness” metric 
which measures the fraction of the attention shift that is due to 
head movement versus eye movement. There are 40 points on 
this plot and each represents the average increase in accuracy 
achieved and the average measure of “owlness” for the user. 


(b) Increase in per-user accuracy partitioned by two thresh¬ 
old values in the “owlness” metric. The users with the high 
“owlness” measures, see zero or less improvement from adding 
in eye pose. The users with the low “owlness” measures, see 
positive improvement from adding in eye pose. 


Fig. 6: The “owlness” metric and its correlation with the increase in per-user accuracy from the first histogram plot to the 
second one in Fig. 


and pupil movement are good predictors of classification 
accuracy for the “Right” region with head pose alone and with 
head and eye pose together, respectively. We were not able to 
find a measure of an individual that correlated highly with 
overall classification accuracy, but there are a few variables 
that correlate with the increase in accuracy achieved by adding 
in eye pose. The most interesting and intuitive one is a metric 
we refer to as “owlness”. It is defined as: 



dh + dp 


( 1 ) 


where dh and dp are the distance of the nose tip and right 
pupil, respectively, from their average position in the back¬ 


ground model. Due to the normalization of the features both 
distances are in the range [0, v/2]. An M value of 0 means 
that a shift in gaze involves only the eyes (“lizard”). An M 
value of 1 means that a shift in gaze involves only the head 
(“owl”). 


Fig. 6a shows the relationship between the “owlness” metric 
and the per-user increase in accuracy achieved. The measure 
of “owlness” for each user is computed by averaging the result 
of ([T]i for each image that passes the face detection and pupil 
detection steps in the pipeline. We partition users into three 
groups: “owl”, “lizard”, and “mixed” based on the value of 
M. Fig. 6b shows how effectively these partitions separate 
the users who gain classification accuracy from eye pose and 
those who do not. In this figure, the “owls” see no effect or 














































































































































a decrease in accuracy, while the “lizards” see a significant 
increase in accuracy. 

VI. Conclusion 

This paper investigates the contribution of head pose and 
eye pose to gaze classification accuracy for different gaze 
strategies. We answer two questions: (1) how much does 
eye pose contribute and (2) how can the inter-user accuracy 
variation be explained? For the former, we show that eye pose 
adds a 5.4% increase in average accuracy (from 89.2% to 
94.6%) with an effective average rate of 1.3 decisions per 
second. For the latter, we propose an “owlness” metric that 
decomposes gaze into head movement and eye movement and 
computes the relative magnitude of each. This metric is used 
to explain the inter-person variation in impact of eye pose on 
gaze classification accuracy. 
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